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■ Abstract 
+-> . 

This paper tackles the problem of selecting among several linear estimators in non- 
parametric regression; this includes model selection for linear regression, the choice of a 
regularization parameter in kernel ridge regression or spline smoothing, and the choice of a 
kernel in multiple kernel learning. We propose a new algorithm which first estimates con- 
^ ' sistently the variance of the noise, based upon the concept of minimal penalty which was 

previously introduced in the context of model selection. Then, plugging our variance esti- 
mate in Mallows' Cl penalty is proved to lead to an algorithm satisfying an oracle inequality. 
Simulation experiments with kernel ridge regression and multiple kernel learning show that 
the proposed algorithm often improves significantly existing calibration procedures such as 
10-fold cross-validation or generalized cross-validation. 
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1 Introduction 



^ ■ Kernel-based methods are now well-established tools for supervised learning, allowing to perform 

various tasks, such as regression or binary classification, with linear and non-linear predictors |X9(, 
118] . A central issue common to all regularization frameworks is the choice of the regularization 
parameter: while most practitioners use cross-validation procedures to select such a parameter, 
data-driven procedures not based on cross-validation are rarely used. The choice of the kernel, a 
seemingly unrelated issue, is also important for good predictive performance: several techniques 
exists, either based on cross-validation, Gaussian processes or multiple kernel learning [fj] 117] 15]. 

In this paper, we consider least-squares regression and cast these two problems as the problem 
of selecting among several linear estimators, where the goal is to choose an estimator with a 
quadratic risk which is as small as possible. This problem includes for instance model selection 
for linear regression, the choice of a regularization parameter in kernel ridge regression or spline 
smoothing, and the choice of a kernel in multiple kernel learning (see Section |2J) . 



* http://www.di. ens.fr/~arlot/ 
^http://www. di.ens.fr/~fbach/ 
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The main contribution of the paper is to extend the notion of minimal penalty [U [2] to 
all discrete classes of linear operators, and to use it for defining a fully data-driven selection 
algorithm satisfying a non-asymptotic oracle inequality. Our new theoretical results presented 
in Section 0] extend similar results which were limited to unregularized least-squares regression 
(i.e., projection operators). Finally, in Section \5\ we show that our algorithm improves the 
performances of classical selection procedures, such as GCV [7] and 10-fold cross-validation, for 
kernel ridge regression or multiple kernel learning, for moderate values of the sample size. 

2 Linear estimators 

In this section, we define the problem we aim to solve and give several examples of linear 
estimators. 

2.1 Framework and notation 

Let us assume that one observes 

Yi = f(xi) + E{ G R for i = 1 . . . n , 

where e±, . . . ,e n are i.i.d. centered random variables with E[e?] = a 2 unknown, / is an unknown 
measurable function X i— ► R and x±, . . . , x n £ X are deterministic design points. No assumption 
is made on the set X. The goal is to reconstruct the signal F = (/(iCi))i<i<n £ K", with some 
estimator F £ R n , depending only on (x\,Yi) . . . , (x n ,Y n ), and having a small quadratic risk 
n _1 ||F — , where Vt E R n , we denote by ||£[|2 the ^2-norm of t, defined as ||i [|§ := X^=i*f- 

In this paper, we focus on linear estimators F that can be written as a linear function of 
Y = (Y\, . . . , Y n ) E R n , that is, F = AY , for some (deterministic) n x n matrix A. Here and in 
the rest of the paper, vectors such as Y or F are assumed to be column-vectors. We present in 
Section 12.21 several important families of estimators of this form. The matrix A may depend on 
xi,...,x n (which are known and deterministic), but not on Y, and may be parameterized by 
certain quantities — usually regularization parameter or kernel combination weights. 

2.2 Examples of linear estimators 

In this paper, our theoretical results apply to matrices A which are symmetric positive semi- 
definite, such as the ones defined below. 

Ordinary least-squares regression / model selection. If we consider linear predictors 
from a design matrix X G R nxp , then F = AY with A = X(X T X)~ 1 X T , which is a projection 
matrix (i.e., A T A = A); F = AY is often called a projection estimator. In the variable selection 
setting, one wants to select a subset J C {1, . . . ,p}, and matrices A are parameterized by J. 

Kernel ridge regression / spline smoothing. We assume that a positive definite 
kernel k : X x X — > R is given, and we are looking for a function / : X — > R in the associated 
reproducing kernel Hilbert space (RKHS) J-, with norm || • If K denotes the n x n kernel 
matrix, defined by K a b = k(x a ,Xb), then the ridge regression estimator — a.k.a. spline smoothing 
estimator for spline kernels [22] — is obtained by minimizing with respect to / £ T |18j : 

1 " 

-J2(Y i -f(x i )f + AH/HI . 

i=l 
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The unique solution is equal to / = Ei=i a ik(-,Xi), where a = (K + nXI) Y. This leads to 
the smoothing matrix A\ = K(K + nA/ n ) _1 , parameterized by the regularization parameter 
A G R + . 

Multiple kernel learning / Group Lasso / Lasso. We now assume that we have 
p different kernels kj, feature spaces Tj and feature maps $>j : X — > Tj, j = I,..., p. The 
group Lasso [23] and multiple kernel learning [114 [3] frameworks consider the following objective 
function 

n 2 P P 

J(/i."J P ) = 5E(«-S=i(/^jW>) +2A^||/ i ||^. = L(/ 1 ,...,/ p ) + 2A^||/ J ||^ . 
i=i j=i j=i 

Note that when $j(x) is simply the j-th coordinate of x S MP, we get back the penalization by 
the £ 1 -norm and thus the regular Lasso |21j . 

Using a 1 / 2 = min^o \ {§ + we obtain a variational formulation of the sum of norms 

2 Ej=ill/ill = mir V:i^ Ej=i {^J - + Thus ' minimizing J(f 1 ,...,f p ) with respect to 

(/i, . . . , / p ) is equivalent to minimizing with respect to rj £ Wj, (see [3] for more details): 

min L(h, . . . , f p ) + X ^ + Vj = \y T (Ej=i ^ + nA/ n ) + A J] ^ , 

where I n is the n x n identity matrix. Moreover, given rj, this leads to a smoothing matrix of 
the form 

A,,A = (Ej=i ^^)(Ef=i + nA/ n )^ , (1) 

parameterized by the regularization parameter A G R_|_ and the kernel combinations in — note 
that it depends only on A -1 /?, which can be grouped in a single parameter set R?_. 

Thus, the Lasso/group lasso can be seen as particular (convex) way of optimizing over rj. 
In this paper, we propose a non-convex alternative with better statistical properties (oracle 
inequality in Theorem [1]) . Note that in our setting, finding the solution of the problem is hard 
in general since the optimization is not convex. However, while the model selection problem 
is by nature combinatorial, our optimization problems for multiple kernels are all differentiable 
and are thus amenable to gradient descent procedures — which only find local optima. 

Non symmetric linear estimators. Other linear estimators are commonly used, such as 
nearest-neighbor regression or the Nadaraya- Watson estimator [10]; those however lead to non 
symmetric matrices A, and are not entirely covered by our theoretical results. 

3 Linear estimator selection 

In this section, we first describe the statistical framework of linear estimator selection and 
introduce the notion of minimal penalty. 

3.1 Unbiased risk estimation heuristics 

Usually, several estimators of the form F = AY can be used. The problem that we consider 
in this paper is then to select one of them, that is, to choose a matrix A. Let us assume that 
a family of matrices (^4a)agA is given (examples are shown in Section [2.2|> . hence a family of 
estimators (-Fa)agA can be used, with F\ := A\Y. The goal is to choose from data some A £ A, 
so that the quadratic risk of Ft- is as small as possible. 
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The best choice would be the oracle: 

A* G argmin j rT 1 \F\ — F\^ j , 

which cannot be used since it depends on the unknown signal F. Therefore, the goal is to define 
a data-driven A satisfying an oracle inequality 



n 



l \\F-X ~ F W 2 2 ^ °n inf f n^Wh - F\\l}+ Rn , 

A aga l J 



(2) 



with large probability, where the leading constant C n should be close to 1 (at least for large n) 
and the remainder term R n should be negligible compared to the risk of the oracle. 

Many classical selection methods are built upon the "unbiased risk estimation" heuristics: 
If A minimizes a criterion crit(A) such that 

VAGA, E[crit(A)] «E [n _1 ||F A - F\\% 

then A satisfies an oracle inequality such as in Eq. ([2]) with large probability. For instance, 
cross-validation [TJ[2D] and generalized cross-validation (GCV) [7j are built upon this heuristics. 

One way of implementing this heuristics is penalization, which consists in minimizing the 
sum of the empirical risk and a penalty term, i.e., using a criterion of the form: 



crit(A) 



n 



^IPa-^IH +pen(A) . 



The unbiased risk estimation heuristics, also called Mallows' heuristics, then leads to the 
ideal (deterministic) penalty 



pen id (A):=E n \\F\ — F 



E 



n- l \\F x -Y\\i 



When F\ = A\Y, we have: 



F x - 


F 


F\ — 


Y 


F e 


W 1 



\\(A X - I n )F\\ 2 2 + \\A x e\\l + 2 (A x e, (A\ - I n )F) 



Fx-F 



+ k 



2 (e, A x e) + 2 (e, (J n - A X )F) 



(3) 
(4) 



i a; 



matrix a 2 I n , Eq. © and Eq. @ imply that 



pen id (A) 



2a 2 ti(A x ) 



■n 



Since e is centered with covariance 



(5) 



up to the term — IE [rz 1 1 1 ^ 1 1 § ] = ~° 2 i which can be dropped off since it does not vary with A. 

Note that df(A) = tr(.A\) is called the effective dimensionality or degrees of freedom [23], so 
that the ideal penalty in Eq. ([5]) is proportional to the dimensionality associated to the estimator 
Ax — for projection matrices, we get back the dimension of the subspace, which is classical in 
model selection. 

The expression of the ideal penalty in Eq. (0) led to several selection procedures, in particular 
Mallows' Cl (called C p in the case of projection estimators) [14] . where a 2 is replaced by some 
estimator a 2 . The estimator of a 2 usually used with Cl is based upon the value of the empirical 
risk at some Aq with df(Ao) large; it has the drawback of overestimating the risk, in a way which 
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Figure 1: Bias- variance decomposition of the generalization error, and minimal/optimal penal- 
ties. 

depends on An and F [8]. GCV, which implicitly estimates a 2 , has the drawback of overfitting if 
the family (Ax)xeA contains a matrix too close to I n [5]; GCV also overestimates the risk even 
more than Cl for most A\ (see (7.9) and Table 4 in [H]). 

In this paper, we define an estimator of a 2 directly related to the selection task which does 
not have similar drawbacks. Our estimator relies on the concept of minimal penalty, introduced 
by Birge and Massart [4J and further studied in [2]. 



3.2 Minimal and optimal penalties 

We deduce from Eq. §3§ the bias-variance decomposition of the risk: 

tv{A[A x )a 2 _ 





n~ l 




2' 




E 


Fx-F 




= n 






2 





n- L \\(A x -I n )F 



n 



bias + variance , 



(6) 



and from Eq. (|3|) the expectation of the empirical risk: 





n- 1 
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F x -Y 


2 -Well 2 ' 

p 2 


= n 






2 Z 





n^WiAx-I^F 



|2 {2tr(Ax)-tr(AjAx))a 2 
12 



(7) 



Note that the variance term in Eq. ([6]) is not proportional to the effective dimensionality 
df(A) = tr(Ax) but to tT(A^Ax). Although several papers argue these terms are of the same 
order (for instance, they are equal when Ax is a projection matrix), this may not hold in general. 
If Ax is symmetric with a spectrum Sp(A^) C [0, 1], as in all the examples of Section [2.21 we 
only have 

< ti(AjAx) < tr(Ax) < 2tr(A A ) - tr(^ A ) < 2tr(A A ) . (8) 

In order to give a first intuitive interpretation of Eq. ([6]) and Eq. ([7]), let us consider the 
kernel ridge regression example and assume that the risk and the empirical risk behave as their 
expectations in Eq. ([6]) and Eq. (J7j); see also Fig. [H Completely rigorous arguments based upon 
concentration inequalities are developed in the Appendix and summarized in Section HI leading 
to the same conclusion as the present informal reasoning. 

First, as proved in Appendix[El the bias n~ l \\{Ax — I^FW 2 , is a decreasing function of the 
dimensionality df(A) = tr(A A ), and the variance ti{Al K Ax)o- 2 n~ l is an increasing function of 
df(A), as well as 2tx{Ax) — ti(AjAx). Therefore, Eq. © shows that the optimal A realizes 
the best trade-off between bias (which decreases with df(A)) and variance (which increases with 
df(A)), which is a classical fact in model selection. 
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Second, the expectation of the empirical risk in Eq. © can be decomposed into the bias and 
a negative variance term which is the opposite of 



pen min (A) := n" 1 ( 2 tx(A x ) - tr(AjA x ) ) a 2 . (9) 



As suggested by the notation pen min , we will show it is a minimal penalty in the following sense. 
If 



VC > 0, A min (C) E argmin In Fx — Y + Cpen min (A) 

AeA [ 2 

then, up to concentration inequalities that are detailed in Section B~2"l A m i n (C) behaves like a 
minimizer of 

g c {\) = E [n^WFx - Y\\\ + Cpen min (A)] - n"V = n" 1 ||(A A - I n )F\\ 2 2 + (C - 1) pen min (A) . 

Therefore, two main cases can be distinguished: 

• if C < 1, then gc(^) decreases with df(A) so that df(A m i n (C)) is huge: A m i n (C) overfits. 

• if C > 1, then gcW increases with df(A) when df(A) is large enough, so that df(A m i n (C)) 
is much smaller than when C < 1. 

As a conclusion, pen min (A) is the minimal amount of penalization needed so that a minimizer A 
of a penalized criterion is not clearly overfitting. 

Following an idea first proposed in [3] and further analyzed or used in several other papers 
such as [El O [16], we now propose to use that pen min (A) is a minimal penalty for estimating a 2 
and plug this estimator into Eq. ([5j). This leads to the algorithm described in Section [4.1i 

Note that the minimal penalty given by Eq. ([9]) is new; it generalizes previous results [U [2] 
where pen min (^4A) = n' 1 tr(A\)a 2 because all A\ were assumed to be projection matrices, i.e., 
A~^Ax = Ax- Furthermore, our results generalize the slope heuristics pen id « 2pen min (only 
valid for projection estimators [HE]) to general linear estimators for which pen id / pen min S (1,2]. 

4 Main results 

In this section, we first describe our algorithm and then present our theoretical results. 
4.1 Algorithm 

The following algorithm first computes an estimator of C of a 2 using the minimal penalty in 
Eq. ([9]), then considers the ideal penalty in Eq. ([5]) for selecting A. 

Input: A a finite set with Card(A) < Kn a for some K, a > 0, and matrices Ax- 

• VC > 0, compute A (C) G argmin A6A {||F A - Y||| + C (2tr(A A ) - tr(^J^ A ))} . 

• Find C such that df(A (C)) E [n 3 / 4 ,n/10]. 

• Select A G argmin AeA {||F A - Y\\ 2 2 + 2Ctr(A A )}. 
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In the steps 1 and 2 of the above algorithm, in practice, a grid in log-scale is used, and our 
theoretical results from the next section suggest to use a step-size of order n -1 / 4 . Note that it 
may not be possible in all cases to find a C such that df(Ao(C)) G [ra 3//4 , n/10]; therefore, our 
condition in step 2, could be relaxed to finding a C such that for all C > C + 5, df (Ao(C)) < n 3 / 4 
and for all C < C - 5, df(A (C)) > n/10, with 6 = n" 1 / 4 ^, where £ > is a small constant. 

Alternatively, using the same grid in log-scale, we can select C with maximal jump between 
successive values of df (Ao(C)) — note that our theoretical result then does not entirely hold, as we 
show the presence of a jump around a 2 , but do not show the absence of similar jumps elsewhere. 



4.2 Oracle inequality 



Theorem 1. Let C and A be defined as in the algorithm of Section \4-l\ with Card(A) < Kn a 
for some K, a > 0. Assume that VA G A, A\ is symmetric with Sp(^4a) C [0, 1], that £j are i.i.d. 
Gaussian with variance a 2 > 0, and that 3Ai, A2 € A with 



df(Ai) > -, df(A 2 ) < Vn, and Vi G { 1, 2} , n~ l \\{A Xi - I n )F\\ 2 2 < a' 



ln(ra) 



n 



(Ai- 



Then, a numerical constant C a and an event of probability at least 1 — SKn 2 exist on which, 
for every n> C a , 



l-91(a + 2) 



2h /^U 2 <c< i + 44(a+ j»r< 



n 



n 



(10) 



Furthermore, if 

3k >1, VAgA, n' 1 tr(A x )a 2 < kE n _1 ||F A -F||^ 

then, a constant Cb depending only on n exists such that for every n > C^, on the same event, 

36(k + a + 2) ln(ra)o- 2 



n 



F~ x -F 



2 / 40k 
2 < I 1 + 



inf In 1 






F x -F 


> 


AeA [ 





n 



(11) 



Theorem [T] is proved in the Appendix. The proof mainly follows from the informal arguments 
developed in Section f3. 21 completed with the following two concentration inequalities: If £ G W 1 
is a standard Gaussian random vector, a G W 1 and M is a real-valued n x n matrix, then for 
every x > 0, 



> 1 - 2e" 



(12) 



p(|(q, 01 <V2i||a|| 2 
P(V6»>0, ||Mf||2 -tr(Af T M) < 6»tr(M T M) + 2(1 + 8~ l ) \\M\\ 2 x\ > 1 - 2e~ x , (13) 

where \\M\\ is the operator norm of M. A proof of Eq. (|12|) and (|13p can be found in AppendixlDl 



4.3 Discussion of the assumptions of Theorem [T] 

Gaussian noise. When e is sub-Gaussian, Eq. (|12p and Eq. (|13p can be proved for ^ = a~ 1 e 
at the price of additional technicalities, which implies that Theorem [T] is still valid. 

Symmetry. The assumption that matrices A\ must be symmetric can certainly be relaxed, 
since it is only used for deriving from Eq. (|13p a concentration inequality for {A\£, £)• Note 
that Sp(^4,\) C [0, 1] barely is an assumption since it means that A\ actually shrinks Y. 
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Assumption ( IA3D . 

of Theorem [TJ but it is only needed for Eq. (jllf) . 



Assumptions (Ai_2). (Ai_2) holds if max^g^ { df(A) } > re/2 and the bias is smaller 
than cdf(A) _rf for some c, d > 0, a quite classical assumption in the context of model selection. 
Besides, (Ai_2> is much less restrictive and can even be relaxed, see Appendix IB! 

The upper bound QA3D on ti(A\) is certainly the strongest assumption 

According to Eq. (jSJ), (A3) holds with 
k = 1 when A\ is a projection matrix since tr(^4^A^) = tr{A\). In the kernel ridge regression 
framework, ( A3 ) holds as soon as the eigenvalues of the kernel matrix K decrease like j~ a — see 
Appendix [El In general, ( A3 ) means that F\ should not have a risk smaller than the parametric 
convergence rate associated with a model of dimension df(A) = tr(Ax). 

When (A3) does not hold, selecting among estimators whose risks are below the parametric 
rate is a rather difficult problem and it may not be possible to attain the risk of the oracle 
in general. Nevertheless, an oracle inequality can still be proved without (A3), at the price 
of enlarging C slightly and adding a small fraction of o 2 n~ l tr(A\) in the right-hand side of 
Eq. (fTTj) . see Appendix [Cj Enlarging C is necessary in general: If ti{A~^A\) <C tr(A\) for 
most A S A, the minimal penalty is very close to 2a 2 n~ l tx(A\), so that according to Eq. (|10p . 
overfitting is likely as soon as C underestimates a 2 , even by a very small amount. 



4.4 Main consequences of Theorem Q] and comparison with previous results 

Consistent estimation of a 2 . The first part of Theorem [T] shows that C is a consistent 
estimator of a 2 in a general framework and under mild assumptions. Compared to classical 
estimators of a 2 , such as the one usually used with Mallows' Cl, C does not depend on the 
choice of some model assumed to have almost no bias, which can lead to overestimating a 2 by 
an unknown amount [8]. 

Oracle inequality. Our algorithm satisfies an oracle inequality with high probability, as 
shown by Eq. (jlip : The risk of the selected estimator F-^ is close to the risk of the oracle, up 
to a remainder term which is negligible when the dimensionality df(A*) grows with n faster 
than ln(n), a typical situation when the bias is never equal to zero, for instance in kernel ridge 
regression. 

Several oracle inequalities have been proved in the statistical literature for Mallows' Cl with 
a consistent estimator of a 2 , for instance in [p3]. Nevertheless, except for the model selection 
problem (see [I] and references therein), all previous results were asymptotic, meaning that n is 
implicitly assumed to be larged compared to each parameter of the problem. This assumption 
can be problematic for several learning problems, for instance in multiple kernel learning when 
the number p of kernels may grow with re. On the contrary, Eq. (jlip is non- asymptotic, meaning 
that it holds for every fixed n as soon as the assumptions explicitly made in Theorem [T] are 
satisfied. 

Comparison with other procedures. According to Theorem[T]and previous theoretical 
results |13t [5]. Cl, GCV, cross-validation and our algorithm satisfy similar oracle inequalities 
in various frameworks. This should not lead to the conclusion that these procedures are com- 
pletely equivalent. Indeed, second-order terms can be large for a given n, while they are hidden 
in asymptotic results and not tightly estimated by non- asymptotic results. As showed by the 
simulations in Section El our algorithm yields statistical performances as good as existing meth- 
ods, and often quite better. 

Furthermore, our algorithm never overfits too much because df(A) is by construction smaller 
than the effective dimensionality of Ao(C) at which the jump occurs. This is a quite interesting 
property compared for instance to GCV, which is likely to overfit if it is not corrected because 
GCV minimizes a criterion proportional to the empirical risk. 
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Figure 2: Selected degrees of freedom vs. penalty strength log(C/<7 2 ): note that when penalizing 
by the minimal penalty, there is a strong jump at C = a 2 , while when using half the optimal 
penalty, this is not the case. Left: single kernel case, Right: multiple kernel case. 



5 Simulations 

Throughout this section, we consider exponential kernels on M. d , k(x,y) = Yii=i e~\ Xi ~ Vi \ with 
the x's sampled i.i.d. from a standard multivariate Gaussian. The functions / are then selected 
randomly as Ya^i otik(-, Zi), where both a and z are i.i.d. standard Gaussian (i.e., / belongs to 
theRKHS). 

Jump. In Figure [2] (left), we consider data x\ £ M 6 , n = 1000, and study the size of 
the jump in Figure [5] for kernel ridge regression. With half the optimal penalty (which is used 
in traditional variable selection for linear regression), we do not get any jump, while with the 
minimal penalty we always do. In Figure [2] (right), we plot the same curves for the multiple kernel 
learning problem with two kernels on two different 4-dimensional variables, with similar results. 
In addition, we show two ways of optimizing over A 6 A = Rj_, by discrete optimization with n 
different kernel matrices — a situation covered by Theorem [T] — or with continuous optimization 
with respect to r\ in Eq. ([1]), by gradient descent — a situation not covered by Theorem [TJ 

Comparison of estimator selection methods. In Figure EJ we plot model selection 
results for 20 replications of data (d = 4, n = 500), comparing GCV [7], our minimal penalty 
algorithm, and cross-validation methods. In the left part (single kernel), we compare to the 
oracle (which can be computed because we can enumerate A), and use for cross-validation all 
possible values of A. In the right part (multiple kernel), we compare to the performance of 
Mallows' Cl when a 1 is known (i.e., penalty in Eq. [5j), and since we cannot enumerate all A's, 
we use the solution obtained by MKL with CV [3]. We also compare to using our minimal 
penalty algorithm with the sum of kernels. 



6 Conclusion 

A new light on the slope heuristics. Theorem [T] generalizes some results first proved in 
[1] where all A\ are assumed to be projection matrices, a framework where assumption ([A3]) is 
automatically satisfied. To this extent, Birge and Massart's slope heuristics has been modified 
in a way that sheds a new light on the "magical" factor 2 between the minimal and the optimal 
penalty, as proved in [H[2]. Indeed, Theorem [1] shows that for general linear estimators, 

pen id (A) = 2trQ4 A ) 
pen min (A) 2tr(,4 A ) - tr^^) ' 
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Figure 3: Comparison of various smoothing parameter selection (minikernel, GCV, 10-fold cross 
validation) for various values of numbers of observations, averaged over 20 replications. Left: 
single kernel, right: multiple kernels. 



which can take any value in (1, 2] in general; this ratio is only equal to 2 when tr(^4^) ~ tr(Aj A\), 
hence mostly when A\ is a projection matrix. 

Future directions. In the case of projection estimators, the slope heuristics still holds 
when the design is random and data are heteroscedastic [2]; we would like to know whether 
Eq. (|14p is still valid for heteroscedastic data with general linear estimators. In addition, the 
good empirical performances of elbow heuristics based algorithms (i.e., based on the sharp 
variation of a certain quantity around good hyperparameter values) suggest that Theorem [T] can 
be generalized to many learning frameworks (and potentially to non-linear estimators), probably 
with small modifications in the algorithm, but always relying on the concept of minimal penalty. 

Another interesting open problem would be to extend the results of SectionHJ where Card(A) < 
Kn a is assumed, to continuous sets A such as the ones appearing naturally in kernel ridge regres- 
sion and multiple kernel learning. We conjecture that Theorem [T] is valid without modification 
for a "small" continuous A, such as in kernel ridge regression where taking a grid of size n in 
log-scale is almost equivalent to taking A = R + . On the contrary, in applications such as the 
Lasso with p 3> n variables, the natural set A cannot be well covered by a grid of cardinality n a 
with a small, and our minimal penalty algorithm and Theorem [T] certainly have to be modified. 

Appendix 

This appendix is mainly devoted to the proof of Theorem Q3 which is splitted into two results. 
First, Proposition [T] shows that n^ 1 a 2 (2ti(A\) — ti(A~[A\)) is a minimal penalty, so that C 
defined in the Algorithm of Section [4.11 consistently estimates a 2 . Second, Proposition [2] shows 
that penalizing the empirical risk with 2C tr(^4^) re 1 and C ~ a 2 leads to an oracle inequality. 
Proving Theorem [T] is straightforward by combining Propositions [U and [21 

In Section [XJ we introduce some notation and make some computations that will be used 
in the following. Proposition [T] is proved in Section [B] Proposition [2] is proved in Section O 
Concentration inequalities needed for proving Propositions [T] and [5] are stated and proved in 
Section [Dl Computations specific to the kernel ridge regression example are made in Section [El 

A Notation and first computations 

Recall that 

Y = F + e 
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where F = (f(xi))i<i< n G W 1 is deterministic, e = (£i)i<i< n G M ra is centered with covariance 
matrix a 2 I n and I n is the n x n identity matrix. For every A G A, F x = A X Y for some n x n 
real- valued matrix A A , so that 







F 








Y 


2 
2 


Fx- 


where Vi, u G 1 


(t, 


u) 





+ [|er[|| - 2 (e, A x e) + 2 (e, (J n - A X )F) 



(15) 
(16) 



Now, define, for every A G A, 



6(A) = ||(A A -J n )F||2 
Wl (A) = tr(^ A )a 2 
5i(A) = (e, A A e) - tr(A A )<r 2 
v 2 {\) = tx{A T x A x )a 2 
5 2 (\) = \\A x e\\l-tr(A T x A x )a 2 
5 3 (X) = 2(A x e, (A x -I n )F) 
5 4 (X) = 2(e, (I n -A X )F) , 



so that Eq. () 15f> and (I16p can be rewritten 

2 



F A 


- F 


F\ 


- Y 



b(X)+v 2 (X) + 5 2 (X)+6 3 (X) 



Fx 



2ua(A)-2<5i(A) + <5 4 (A) + ||e|| 2 



(17) 
(18) 



Note that 6(A), t»i(A) and ^(A) are deterministic, and for all A G A, all #i(A) are random 
with zero mean. In particular, we deduce the following expressions of the risk and the empirical 
risk of F x : 



E 



Define 



E 


n- 1 


Fx 




n- 1 






2" 


h- 


Y 








2 



n 

{2tT(Ax)-ti{AlA x ))a 2 



a 2 = n- 1 \\(A x -I n )F 



n 



(19) 
(20) 



\A X \\ :=maxSp(A\) = sup 



\\Ax4, 



Since tr(A x ) < \/ntT(ATA x ), we have 



Vi(X) < <T\/nv2(X) . 
In addition, if A x has a spectrum Sp(A A ) C [0, 1], then 

< tr (A X A X ) < tr{A x ) < 2tr(A x ) - tr{A~[A x ) < 2tr{A x ) 

so that 

< u 2 (A) < ui(A) < 2vi(A) - v 2 (X) < 2vi(A) . 



(21) 



(22) 
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B Minimal penalty 

Define 

VC>0, Ao(C) G argmin 

AeA 



Fx-Y 



+ C[2tT(A x )-tr(A[A x ) 



(23) 



We will prove the following proposition in this section. 

Proposition 1. Let Xq be defined by Eq. (j23f> . Assume that VA G A, A\ is symmetric with 
Sp(A\) C [0, 1], that Ei are i.i.d. Gaussian with zero mean and variance a 2 > 0, and that 



3X 1 G A, df(Ai) > | and 6(Aj) < a 2 yjn\n(n) 

3A 2 G A, df(A 2 ) < \/n and b{\ 2 ) < a 2 \Jn\a(n) . 
Then, a numerical constant C\ > exists such that for every n > C\, for every 7 > 1, 

n 



vo < c < 1 - 917 



ln(n) 



n 



df(A (C)) > 
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and VC > 1 + 



44 7 yin( 
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n 



1/4 



(7 



df(A (C)) < n 3 / 4 



(Ai) 
(A 2 ) 

(24) 
(25) 



hold with probability at least 1 — 8Card(A)n 7 . 

If Card (A) < Kn a , Proposition Q] with 7 = a + 2 proves that with probability at least 
1 — 8Kn~ 2 , C defined in the Algorithm of Section f4.1l exists and 



l-91(a! + 2) 



ln(n) 
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2 <d< l + U{a + 2 ]^)a 2 



Remark 1. If QAiD is replaced by 

3Ai G A, df(Ai) > a n and 6(Ai) < (T 2 /3„ (A' x ) 
/or some a n > ln(n) and (3 n > 0, £/ien Proposition^ still holds with Eq. (|24"j) replaced by 



VO < C < 1 



3& 



6271 



'ln(n) 



(T 



df(A (C)) > 



(26) 



Remark 2. if ( IA2D «s replaced by 

3X 2 G A, df(A 2 ) < n a and 6(A 2 ) < o- 2 f3„ 



(A' 2 ) 



/or some a G [1/2, 1) and /3„ > \Jn ln(n) ; t/ien /or every (3 G (a, 1) Proposition^ still holds for 
n > max { C\ , 4 1 / (^ _a ) } Eg. (f25|) replaced by 



VC > (l + 44 7 /5„n~ /3 ) a 2 , df(A (C)) < n' 3 



(27) 



Remark 3. On £/ie event defined in Proposition^ we can derive from Eq. (fT7]) . ([53]) . (j62j) . and 
||^1 A || < 1, i/iai 



VA G A suc/i i/iai df(A) > 



n 



ln(n) 



n 



F\ — F 



> 



1 



87 ln(n) 



21n(n) 



n 



a 



Hence, the blow up of df(Ao(C)) holding when the penalty is below the minimal penalty also 
implies a blow up of the risk n _1 



A (C) 



F 



Let us now prove Proposition [TJ 



12 



B.l General starting point 

Combining Eq. (j23[) with Eq. (|17f) and (JTHJ) , for every C > 0, Ao(C) also minimizes over A E A 

critc(A) := |f> - y||* - || e [|2 + C (2tr(A A ) - tr(A^A\) 

= 6(A) + (a- 2 C - 1) (2^i (A) - v 2 (X) ) - SWi(A) + <5 2 (A) + S 3 (X) + 5 4 (A) . 

We now use the concentration inequalities of Eq. ([53]) , flS3J) , (|6Tj) and (|62|) proved in Section [Dj 
For every A E A and x > 1, an event of probability 1 — 8e~ x exists on which for every C > 
and 6 > 0, 

critc(A) > ^ + (a- 2 C - l)(2«i(A) - v 2 (X)) - 30^(A) - 6(2 + O^xa 2 (28) 
3 

critc(A) < + (a- 2 C - 1)(2^(A) - u 2 (A)) + 30«i(A) + 6(2 + O^xa 2 , (29) 

using also that v 2 < i>i by Eq. (f22j) and that \\A\\\ < 1. 

For every x > 1, let be the event on which the inequalities appearing in Eq. (j53l) . (j54"l) . (j6"T1) 
and (i62j) hold for every > and A E A. The union bound shows that F(Q X ) > 1— 8 Card(A)e _:E . 

B.2 Below the minimal penalty 



We assume in this subsection that C E [0, a ). We will prove Eq. (|26|) using assumption (A' x ) 



since when a n = n/2 and /3 n = \Jn ln(n), Eq. (I26p is Eq. (|24|) and ( ) is ( Ai ). 



Using Eq. fl22|) and taking = v / x/df(A) in Eq. ([Ml) and ([23), we get that for every x > 1, 
on Q x , for every A E A, 

critc(A) > ^ + 2(C-CT 2 )df(A) - ^9^df(A) + 12x) a 2 (30) 
56(A) 



crit c (A) < — p + (C - a 2 ) df (A) + (^xdf(A) + 12x J a 2 . (31) 

Let A E A. Two cases can be distinguished: 
1. If df (A) < On/5, then Eq. (gDJ) implies 

critc(A) > 2(C " f )an - ( 9 X [^ + I2x) a 2 . (32) 



5 



2. If df(A) > an, then Eq. flUE} implies 



critc(A) < ^^ + (C-o- 2 )a n + (9^/ia^+12x)o- 2 . (33) 



We now take x = jln(n) so that P (0,. ) > 1 - 8 Card(A)n -7 . 
On the one hand, Eq. (|32p implies 



inf {critc(A)} > 2(C 7 2) °" - ( gJ^^M + 127ln(n) ) J . (34 ) 

AeA,df(A)<a„/5 5 IV 5 
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On the other hand, for A = Ai given by assumption ( A' x ), Eq. ([33}) implies 



crit c (Ai) < ~^ + {C-a 2 )a n + (^7^ hr(n) + 12 7 ln(n) J ct 2 . (35) 
Comparing Eq. (f3"4"]) and Eq. ([33]) . we get that 



critc(Ai) < inf {crit c (A)} 

AeA,df(A)<a n /5 



hence df(Ao(C)) > a n /5 if 



1 - a- 2 C > ^ + 62 7 , ^ 



B.3 Above the minimal penalty 

We assume in this subsection that C > a 2 . We will prove Eq. ([27]) using assumption ( A' 2 ) , since 



when a = 1/2, /3 n = ^Jn ln(n) and /? = (1 + a )/2 = 3/4, Eq. (ETJ) is Eq. ([25]) and ([Xjls ( [A"""] ) 

Using Eq. ([22]) and taking 6* = y^/d^A) in Eq. ([28]) and ([29]) . we get that for every x > 1, 
on f2 x , for every A G A, 

critc(A) >^ + (C -a 2 )df(\)- (9 v / xdf(A) + 12x) a 2 (36) 
56(A) 



critc(A) <^ l + 2(C-a 2 )df(X) + ( K 9^/xdf(\) + l2x)a 2 . (37) 

Let A € A, and f3 G (a, 1). As in Section [B.2] we consider two cases. 

1. If df (A) < n a , Eq. §7$) implies 

critc(A) < 26(A) + 2(C - cr 2 )n a + (^9^/x^ + 12s) a 2 . (38) 

2. If df (A) > n?, Eq. ([36]) implies 

critc(A) > (C-ct 2 )^- (^9v / x^ + 12x)cr 2 . (39) 

We now take x = 7ln(n) as in Section [B.21 



On the one hand, for A = A2 given by assumption (A2), Eq. (I38p implies 



crit c (A 2 ) < 2a 2 (3 n + (C-a 2 )^- + (9y/-y ln(n)n a + 12 7 ln(n) J a 2 (40) 



if n (3 ~ a > 4 . 

On the other hand, Eq. (I39p implies 



inf {critc(A)} > (C - a 2 )n P - ( 9\fy\Mr^ + 12 7 ln(n) ) a 2 . (41) 

AeA, df (\)>nl 3 V 
Comparing Eq. ([4*0]) and Eq. (["""I"]) , we get that 



critc(A 2 ) < inf {crit c (A)} 

AgA,df(A)>n/3 



hence df(A (C)) < if 



n > 4 1 /( /3 ~ a ) , vV ln(n) > 12 , and ct~ 2 C - 1 > 44 7 /3 n n~ /3 
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C Oracle inequality 



Define 



VC > 0, Aopt(C) G argmin 



Fx-Y 

We will prove the following proposition in this section. 



+ 2Cti(A } 



(42) 



Proposition 2. Let A op t be defined by Eq. (|42p . Assume that VA G A, is symmetric with 
Sp(Ax) C [0,1], i/tai e, are j.id. Gaussian with zero mean and variance a 2 > 0. 

T/ien, a numerical constant C 2 > exists smc/i that for every n > C 2 , 7 > 1, ?7 + > (ln(n) )~ , 

and C > suc/t i/iai a _2 C G 1 + (ln(n) ) _1 , 1 + rf> 



n 



Aopt(C) 



< 1 + 



ln(n) 



inf 1, n 1 

AeA 



F\ — F 



+ 4n~ 



V tr(4> 



n 



2 „2 



+ 



14 7 (ln(n)) a 



n 



(43) 



ZioWs with probability at least 1 — 8Card(A)n 7 . 
// in addition 



3k > 1, VA G A, ui(A) < k ( u 2 (A) + 6(A) + (ln(ra) ) 2 cr 2 



(A' 



then a constant C3 > depending only on k exists such that for every n > C3, 7 > 1, and C > 
suc/i £/iaf a~ 2 C G 



n 



Ft 



l-(ln(n))- 1 ,l + (ln(n))" 1 
40k 



F 



2 V m(nj 



Aopt(C) 

/10/ds with probability at least 1 — 8 Card(A)n~ 7 . 



inf In 1 






Fx — F 


:}+ 


AeA L 





36(k + 7) ln(n)cr 2 



■n 



(44) 



If Card (A) < Kn a , Proposition [2] with 7 = a + 2 proves that with probability at least 
1 — 8Kn~ 2 , A defined in the Algorithm of Section \A. II satisfies an oracle inequality if assumption 
(gO holds. 



Remark 4. Assumption ( A 3 ) ZioWs as soon as (A 3 ) holds, i.e., 





n- 1 




2' 




E 


Fx — F 




= n 






2 





-1 



«2(A) + 6(A)) > k 



^O" 2 tr(A; 



r? 



which is the parametric rate of estimation in a model of dimension tr(A\). 

In £/ie ordinary least-squares regression example, where all Ax are projection matrices, as- 
sumption (A3) always holds with k = 1 because fi(A) = ^(A). 



In i/ie kernel ridge regression example, a sufficient condition for ( A 3 ) is that the eigenvalues 
(^j)i<j<n °f the kernel matrix K satisfy 

3a, L l ,L 2 > 0, VI < j < n, L\j~ a < H < L 2 j~ a , 

which is a classical assumption in kernel ridge regression with a random design; see Section \E. £\ 
for details. 
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Remark 5. When ti(A~^A\) <C ti(A\) for most A £ A, taking C slightly larger than a 2 as in 
the first part of Proposition is necessary to obtain an oracle inequality. Indeed, Proposition [7] 
then shows that 

2ti(A x ) -tv(AjAx)) a 2 n- x 2^(A x )a 2 n 



is a minimal penalty. So, any underestimation of the constant C in the penalty 2Ctr(A^)n _1 
may lead to selecting A = A opt (C) with df(A) > n/(ln(n)). 

Such a phenomenon holds for instance when A\ = XI n and A C [0,1], since tr(A~^A\) = 
tr^A) 2 "-" 1 < tr(A\) unless tr(A\) oc n. 

Remark 6. The remainder terms in Eq. (|43f) and (|44p , 147(ln(n)) 2 <7 2 n~ 1 and 36(k+7) ln(n)er n — 
are negligible in front of the risk of the oracle provided that V2(X*) tends grows with n faster than 
(ln(n)) 2 , since the risk of Fx* is at least of order V2(\*)n~ 1 . This usually holds when the bias 
is not exactly zero for some A G A with ti(AlA\) too small, as often assumed in the model 
selection literature for proving asymptotic optimality results. 

Let us now prove Proposition El 



C.l General starting point 

Combining Eq. (fTBj) and (j4"2j) . we obtain that for every C > such that o~ 2 C G [1 — r]~ , 1 + 
and every A G A, 



F- — F 

A op t (C) 



< inf 

AeA 



2rT«i(A opt (C)) + A(A opt (C)) 

2 



F 



+ 2n+v 1 (\) + A{\) 



(45) 



where 

A (A) := -2Si(X) + S A (X) . 

Inequality (|45p implies an oracle inequality as soon as A (A) is small compared to 
and r]~,rj + are small enough. 



Fx -F 



C.2 Make use of concentration inequalities 

Let Q x denote the same event as in Section [B] From Eq. (|54p and (|6ip . since ||^4a|| < 1, we 
deduce that on £l x 



va g a, ye > o, 



A(A) <eb{\) + 2ev l {\) + {A + b6- l )xo- 2 



(46) 



In addition, combining Eq. (fTTl) . (f53l) and (|62|) with 6 = 1/2, and \\A\\\ < 1, we have on fi a 

2 



VAGA, 6(A) + v 2 {\) < 2 



+ 16xcj z 



(47) 



C.3 First result: with a slightly enlarged penalty 

Assume in this subsection that a~ 2 C G [l + (ln(n)) ; 1 + r? + ] with n + > (ln(n)) -1 . Then, 
Eq. (g5|) and ([MD with 9 = (ln(n))" 1 imply 



F~ — F 

Aopt (C) 



2 1 - (ln(n)) -1 AgA 



F\ — F 



+ 4r ] + v 1 (X) } + (9 + 121n(n))x<7 2 , (48) 
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if ln(n) > 5 . 

Taking x = 7ln(n) with 7 > 1, then F(tt x ) > 1 -8 Card(A)n^ and Eq. 
for every n > C2 = e 5 . 



implies Eq. (f4*3"|) 



C.4 Second result: with assumption (A3) 



Assume in this subsection that a 2 C G [1 — r\ ; 1 + rj + ] with < n ,rj + < (ln(ra)) 1 , and that 



(A' 3 ) holds. 



Then, Eq. (06]) with 9 = (ln(n))^ 1 and Eq. (J37J) imply 

2 



> 1 



Aopt(C) 

ln(n) 



F 



277-^1 (Aopt(C)) + A(Aopt(C)) 





2 


F~ — F 

Aopt(C) 


2 



80k , 

4 + — - x + 9\a(n)n 
m(n) 



a 



(49) 



and for every A 6 A, 



Fx -F 



+ 2r ] +v 1 (X) + A(X) 



< 1 + 



10k 
ln(n) 



Fx-F 



+ 



80k , 

4 + I x + 91n(n)K 



ln(n) 



Combining Eq. (@SD, flUl and (J50"j) . we get that on fi^, 

2 



F~ — F 

A op t (C) 



2 / 40k 
2 \ m(nj 



F 



+ 4 



4 + 



80k 
ln(n) 



x + 9 ln(n)K 



(50) 



(51) 



if ln(n) > 20k. 

Now, taking x = 7ln(n) with 7 > 1 in Eq. (|51|) implies Eq. (|44|) for every n > C*3(k). ■ 

D Concentration inequalities 

The concentration inequalities used for proving Propositions [T] and are proved in this section. 
D.l Linear functions of e 

We here prove concentration inequalities for 53(A) and 54(A). Let us first prove a classical result. 
Proposition 3. Let £ be a standard Gaussian vector in W 1 , a £ M. n and 

n 

Z = (£, a) = afo . 



Then, for every x >0, 



\Z\ < V2x\\a\\ 2 ) > l-2e~ x 



(52) 



Proof. Z is a Lipschitz function of £, with Lipschitz constant ||a|L. Therefore, the Gaussian 
concentration theorem implies (see for instance Theorem 3.4 in |15j): 

Vi>0, P(|Z| > < 2exp f ^— . 

\ 2 llaL / 



The result follows by taking t = y2x \\a\ 



□ 
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Now, remark that 

$ 3 (A) = U- l e, 2aAj(I n -A x )F) and 5 A {\) = {a" l e, 2a{I n - A X )F) , 



where a 1 e is a standard Gaussian vector. Hence, Proposition [3] shows that for every x > and 
A G A, 



|<5 3 (A)| < 2ay/i A T x (I n - A X )F J > 1 - 
P(|«5 4 (A)| <2av^||(/„-A A )F|| 2 ) > 1 - 2e~ x , 

which implies that 

p(w>0, |5 3 (A)| <0" 1 x|| J 4 A || 2 cj 2 + 0||(/ n -A A )F||^ > l-2e" x (53) 

p (ye > o, |<5 4 (A)| < e^xa 2 + \\{i n - a x )f\$) > 1 - 2e ~ x , ( 54 ) 

since Va, b,0 > 0, 2ab < 9a 2 + Q- l b 2 . 
D.2 Quadratic functions of e 

We here prove concentration inequalities for 82 (A) and <5i(A). Let us first prove (recall) a general 
result. 

Proposition 4. Let ^ be a standard Gaussian vector in M n , M a real-valued n x n matrix and 

Z = \\M£\\ 2 2 -tr(M T M) . 

Then, for every x > 0, 

p(v0>O, Z < 6>tr(M T M) + 2(1 + 6T 1 ) ||M|| 2 x) > I - e~ x (55) 
'v0 > 0, Z > -6tr(M T M) - [2x (9" 1 - l) + 1 - 9] \\M\\ 2 ) > 1 - e~ x . (56) 



Proof. Define W = ||M£|| 2 , and note that E [W 2 ] = tr (M T M) . Since W is a Lipschitz func- 
tion of £ with Lipschitz constant ||M||, the Gaussian concentration theorem (see for instance 
Theorem 3.4 in [15]) shows that for every x > 0, an event 0+ of probability at least 1 — exp(— x) 
exists on which 

W <E[W] + V2x~\\M\\ , (57) 
and an event of probability at least 1 — exp(— x) exists on which 

W > E[W] - y/2x\\M\\ . (58) 

In addition, Proposition 3.5 in [TS] shows that v&i(W) < ||M|| 2 . Therefore, 

<E[W 2 ] - {E{W}) 2 = v&r(W) <\\M\\ 2 . (59) 

We now combine Eq. (I59D with the two concentration inequalities above for proving the result. 
On the one hand, on 



W < (E [W] ) + 2E [W] V2x \\M\\ + 2x \\M 



< E [W 2 ] + 2 x /2xE [W 2 ] \\M\\ + 2x \\M\\ 2 

< (1 + e)E [w 2 ] + 2(1 + e-^x \\m\\ 2 
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for every 8 > 0, using successively Eq. ([59]) and that Va, b, 6 > 0, 2yab < a6 + b6 l . This proves 
Eq. flESJ. 

On the other hand, for every x > such that x < (E [VF 2 ] - ||M|| 2 )/(2 ||M|| 2 ), on Q~ 

W 2 > (^Je[W 2 } - \\M\\ 2 - V2x~ \\M\\\ 

>(l-9)E[W 2 } - [2x {0- 1 - I) + I - 6] \\M\\ 2 . (60) 

This proves Eq. ([56]) . since the lower bound in Eq. ([60]) is non-positive if x > (E [W 2 ~\ — 
||M|| 2 )/(2||M|| 2 ). □ 

Now, remark that if B\ exists such that A\ = BjB\ — as in the kernel ridge regression 
example for instance, and more generally when A\ is symmetric real-valued with Sp(ylA) C 
[0, 1]—, then 

a~ 2 6 l {X) = WBxia-^Wl - tr(BjB x ) and a~ 2 5 2 (X) = \\Axia~ 1 e)\\l - tr(A T x A x ) . 

Hence, Proposition H] shows that for every x > and A E A, 

P(W>0, |(Ji(A)| < ^ 2 tr(^ A ) + 2(l + ^- 1 )xp A ||a 2 ) > l-2e- x (61) 
P(v^>0, |<5 2 (A)| < 9a 2 tr(Aj A x ) + 2(1 + 0- 1 )x\\A x \\ 2 a 2 ^ > 1 - 2e~ x , (62) 

where we used in particular that ||-Ba|| 2 = II^aII- 

E Kernel ridge regression example 

Finally, let us make some computations that are specific to the kernel ridge regression example. 
E.l Explicit formulas for the deterministic terms 

Let K be the n x n matrix such that Kij = k(xi,Xj). Then, the kernel regression estimator 
with regularization parameter A is defined by 

F\ = A\Y with A\ = K(K + nA/^)" 1 . 

Then, A\ is symmetric, real-valued (hence diagonalizable by orthogonal matrices) and Sp(A A ) C 
[0,1]. 

Let (ej)i<j<n be the (orthonormal) eigenvectors of K, with eigenvalues (/J,j)i<j< n , assuming 
that fj,i > fi2 > • • • > [i n ^ 0- We also assume that [i\ > 0, that is, K is not the null matrix. 
We can then decompose F in this basis: F = ^ . fjej. 

Therefore, in the orthonormal basis (ej)i<j<n, A\ is diagonal with coefficients 

H 
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Hence, 



tr(A x ) = df (A) = 



tr(AjA A ) = £ 



— f V Hi + nA 

1=1 x ■> 



2tr(A A )-tr(AjA A ) = ^ 



5=1 



+ nA \ /ij + nA 



Mi + 2nA) 
(/ij + nA) 2 



5(A) = ||(A A -/ n )F|| 2 = ^(l 



//j + nA 



f 2 



Note that df(A) and Xx{A\A\) are decreasing functions of A, as well as 2tr(A A ) — tr(AjA A ) 
since each term of the sum is nonincreasing, and at least one is decreasing. On the contrary, 
6(A) is an nondecreasing function of A. Hence, tr(A~^A\) and 2tr(A A ) — tr(A7A A ) are increasing 
functions of df (A), and 6(A) is a nonincreasing function of df(A). 



E.2 Sufficient condition for assumption (A' 3 ) 



Assumption ( A3 ) holds in particular when 

3k > 1, VA € A, tr(A A ) < Ktr(AjA A ) . 
If the eigenvalues of K satisfy 

3a, Li,L 2 > 0, VI < j < n, L x j' a < fij < L 2 j~ a 

then, following [9], 



tr(A A ) < ]T 



L23' 



\ L 2 j- a +n\ S 1 + n \L 2 l j 



< 



dt 



1 + nXL~H a 



L2 
nX 



l/a 



du 



l + u c 



and 



tr(A T x A x ) > £ 



Lif 



> 



Therefore, (A3) holds with 



dt 

(l + nXL^H a ) 



U\ 1/a r°° du 



y i 



nA 



du 



1 ( 1 + u a ) 



2 ■ 



Li J Jo l + u a \ J 1 (l + u a ) 



du 
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